Skip to main content

LLM Evaluation

Method

use DataSet to evaluate LLMs
The DataSet is math problem
Paper: DeepSeek-R1

Problem

How many problem are LLM memory?

Paper: GSM-Symbolic
Paper: Premise Order Matters in Reasoning with Large Language Models

If we change DataSet is it good enough?

Paper: ARC-AGI

Conclusion - Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure"

Method
Problem
- How many problem are LLM memory?
- If we change DataSet is it good enough?
Conclusion - Goodhart's Law